POSEIDON: Peptidic Objects SEquence-based Interaction with cellular DOmaiNs: a new database and predictor

Cell-penetrating peptides (CPPs) are short chains of amino acids that have shown remarkable potential to cross the cell membrane and deliver coupled therapeutic cargoes into cells. Designing and testing different CPPs to target specific cells or tissues is crucial to ensure high delivery efficiency and reduced toxicity. However, in vivo/in vitro testing of various CPPs can be both time-consuming and costly, which has led to interest in computational methodologies, such as Machine Learning (ML) approaches, as faster and cheaper methods for CPP design and uptake prediction. However, most ML models developed to date focus on classification rather than regression techniques, because of the lack of informative quantitative uptake values. To address these challenges, we developed POSEIDON, an open-access and up-to-date curated database that provides experimental quantitative uptake values for over 2,300 entries and physicochemical properties of 1,315 peptides. POSEIDON also offers physicochemical properties, such as cell line, cargo, and sequence, among others. By leveraging this database along with cell line genomic features, we processed a dataset of over 1,200 entries to develop an ML regression CPP uptake predictor. Our results demonstrated that POSEIDON accurately predicted peptide cell line uptake, achieving a Pearson correlation of 0.87, Spearman correlation of 0.88, and r2 score of 0.76, on an independent test set. With its comprehensive and novel dataset, along with its potent predictive capabilities, the POSEIDON database and its associated ML predictor signify a significant leap forward in CPP research and development. The POSEIDON database and ML Predictor are available for free and with a user-friendly interface at https://moreiralab.com/resources/poseidon/, making them valuable resources for advancing research on CPP-related topics. Scientific Contribution Statement: Our research addresses the critical need for more efficient and cost-effective methodologies in Cell-Penetrating Peptide (CPP) research. We introduced POSEIDON, a comprehensive and freely accessible database that delivers quantitative uptake values for over 2,300 entries, along with detailed physicochemical profiles for 1,315 peptides. Recognizing the limitations of current Machine Learning (ML) models for CPP design, our work leveraged the rich dataset provided by POSEIDON to develop a highly accurate ML regression model for predicting CPP uptake. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00810-7.


Introduction
The biomedical field faces a significant challenge in the development of pharmacological compounds that can be efficiently delivered to binding sites.Cell-Penetrating Peptides (CPPs) provide a safe and effective means of delivering therapeutic agents and other cargoes into cells without causing damage to the cell membrane.Such cargo may include nucleic acids, proteins, peptides, nanoparticles, fluorophores, small therapeutic compounds, and peptide nucleic acids [1][2][3][4].CPPs share common structural and physicochemical features, including short amino acid sequences consisting of 4-40 residues, which typically adopt α-helical structures [1,[5][6][7].They are often amphiphilic or cationic, soluble in water, partially hydrophobic, and rich in arginine and lysine residues [6,[8][9][10].
CPPs have been extensively studied for their potential use as drug delivery systems and diagnostic tools in various medical areas, such as immunotherapy [11], neurological disorders [12], and cancer [13].Although the number of clinical trials involving CPPs has increased, only one CPP has been approved by the European Medicines Agency (EMA) [1,14].The design and testing of different CPPs in vitro and in vivo can be expensive and labor-intensive [15,16].Therefore, efficient computational tools and methodologies are necessary for rapid and accurate identification of suitable CPPs.Recently, many computational resources have been used to provide information on CPPs design and uptake ability, including Machine Learning (ML) approaches such as C2Pred [17], CPPred-RF [18], SkipCPP-Pred [19], CellPPD-MOD [20], ML-based prediction of CPP (MLCPP) [21,22], Kernel Extreme Learning Machine-based prediction (KELM-CPPpred) [23], and StackCPPred [24].However, existing methods rely solely on classification approaches because of the limited qualitative nature of the data available in current databases.One of the most commonly used databases, CPPsite 2.0, published in 2016, contains qualitative data for over 1,800 CPPs sequences [2].
We created POSEIDON-Peptidic Objects SEquencebased Interaction with cellular DOmaiNs, a comprehensive database containing quantitative uptake values and physicochemical properties of 1,315 cell-penetrating peptides across various scenarios, to fill gaps in the current CPP design.POSEIDON is indeed the most extensive database of quantitative CPP uptake values, with up-to-date information and unique data collection.Furthermore, POSEIDON includes a processed dataset that employs a well-designed methodological approach, making it an ideal benchmark for the development of new ML algorithms.By leveraging this database, coupled with cell line genomic features, we developed a novel ML regression model that accurately predicted CPP uptake efficiency.

Data extraction and curation
The general workflow for data collection is shown in Fig. 1, which depicts the collection, organization, and extraction of accurate and relevant information from various sources to create a centralized and annotated database.CPP sequences and associated features were first collected from the CPPsite 2.0 database [2].We obtained the first dataset, composed of 1,855 entries corresponding to each entry to a CPP and their features in the dataset.The information retrieved from this database included the CPP identifier, its name, and corresponding sequence, along with information such as PubMed IDs, cell lines used in the study, and cargo coupled to the CPP.All scientific articles referenced in CPPsite2.0were manually curated to fill POSEIDON with CPPs quantitative uptake values and respective units.Uptake values were recorded when quantitative data were available in plots or when they were directly mentioned by the authors.In addition, the temperature, concentration, time for CPP incubation, and uptake evaluation methods from the referenced articles were manually annotated.Therefore, only peptides with quantitative information were retained in the dataset, reducing the number of curated entries to 906, which corresponds to 676 unique CPPs.
Subsequently, we conducted a thorough literature search to supplement the database with manually curated samples.This process involved extensive and careful examination of relevant publications to identify additional data points.To this end, another 228 CPP-related articles from PubMed were queried using the filters "((((CPP) AND (Cell Penetrating Peptide)) OR (Cellpenetrating Peptide)) AND (Cellular Uptake)) AND (("2015/11/19 "[Date-Publication]:"2022/08/01 "[Date-Publication])))")" were evaluated and quantitative experimental information was added when existent.The final database comprised 2,371 entries, of which 1,315 were unique CPPs and 1,056 were CPPs with different uptake conditions.The latter refers to unique peptides that have been repeated under different conditions, such as varying cargoes, cell lines, temperatures, or incubation times, to analyze the uptake capacity of a peptide under different conditions.
To develop a suitable ML approach, it was necessary to refine the dataset to ensure the uniformity of the target variable (Uptake) in units, values, and experimental determination approaches.The following steps were performed to obtain a benchmark dataset for ML training and testing: The POSEIDON original dataset and the ML predictor dataset are available at the following GitHub repository: https:// github.com/ Morei raLAB/ posei don/ tree/ main/ data.These datasets are stored under the names "CPP_dataset.csv"and "CPP_ML.csv",respectively.

Feature extraction
To prepare the dataset for ML, the POSEIDON pipeline incorporates various features that aim to characterize peptides, cell lines, and experimental conditions.
The features can be further classified into three subcategories.
• Whole-peptide features were obtained using the Peptides R package [25].• In-house position one-hot encoding features based on the size of the longest peptide.One-hot encoding is a reliable and interpretable method for representing categorical data such as amino acids in peptides [26,27].It is compatible with traditional ML algorithms, is robust to data variations, and minimizes information loss.• Annotation-based features, in which the sequence anomaly type and location were substituted with the closest amino acids (Additional file 1: Table S1).
Cell line features (736 in total) were obtained from the Genomics of Drug Sensitivity in Cancer (GDSC) [28] database and matched with the cell lines of the POSEI-DON dataset.They were then tagged as a true match depending on whether they were present on the GDSC.The POSEIDON dataset contained 43 available cell lines from the GDSC (Additional file 1: Table S2).
Finally, the experimental conditions were characterized using several variables (71 in total), including concentration (μM), categorical temperature (°C), incubation time availability and duration (in minutes), and curated cargo to avoid repetition (Additional file 1: Table S3).Prior to dimensionality reduction, this added up to 2,908 features (Table 1).

Data pre-processing and statistics treatment
Data cleaning, visualization, selection, and preprocessing of the raw dataset were performed using the programming language R (version 4.1.0)[29].Peptides with unknown uptake values were excluded from the final dataset, as the methodologies used in these studies did not quantitatively measure peptide internalization.The resulting dataset consisted of 2,371 peptides with quantitative values, varying units, and uptake-evaluation techniques.
Subsequently, statistical analysis of the data was performed using RStudio (version 1.4.1717)[30].The tidyverse package (version 1.3.1),which includes dplyr for data manipulation and ggplot2 for data visualization [31], was used for the data analysis.
The dataset underwent several uniformization steps such as incubation time uniformization, temperature encoding, valid peptide sequence generation, and curation of the target variable (peptide uptake) in log10 form, as it provides a more comprehensible scale.
Feature extraction was performed as described, resulting in 1,330 usable features after removing features with null variance, which can be fully explained and linked to real information, as depicted on the website.A random 70-30 data split was performed, and data normalization was applied based on the average and standard deviation of the training set, which was then applied to both the training and test sets.The decision to retain dimensionality without reduction was bolstered by several factors: the sample size of the dataset, the relevance of domain-specific features, the robust performance of the model on an independent test set encompassing 30% of the total data, the need for transparency to facilitate interpretability, and the model's evident ability to withstand overfitting despite its high dimensionality.Notably, this high dimensionality was driven by the inclusion of relevant one-hot  Categorical temperature (°C).Although it is possible to use a numerical variable, there are only five available temperatures with biological relevance.For example, 37 °C is the regular human body temperature and 25 °C is a common room environment.For these reasons, and because in some cases, there is no temperature information available, the temperature was categorically encoded 2 Incubation time and duration (min) 63 Annotated cargo was manually curated in several steps of the dataset.Initially, only cargoes annotated in the original research papers were considered.Additionally, while processing the dataset, position-independent additions were considered as cargoes encoding features that accounted for 98% of the feature space.

POSEIDON front-end implementation
A web server free available to the scientific community can be found at https:// morei ralab.com/ resou rces/ posei don/.The webserver was constructed using the Nginx webserver with a Linux operating system.To develop the web interface, Flask [36] was used as the back end and HTML, CSS, and JavaScript were applied as the front end in conjunction with Plotly [37] for dynamic plot visualization.Upon navigating to the POSEIDON platform, users are greeted with an intuitive interface designed to facilitate the submission of peptide sequences for prediction.Detailed instructions are provided on the homepage to guide users through the input process.This involves the following steps: • Users input peptide sequence(s) into a designated text field within the interface.• After entering the sequence, users can customize properties, such as peptide concentration, incubation time, temperature, and cell line type.• Users are required to provide a valid email address to which the prediction results will be sent.• To initiate the prediction process, users must click the 'Submit' button.
Data and associated code underpinning the analyses presented herein are accessible via the repository at https:// github.com/ Morei raLAB/ posei don.

Database description
The POSEIDON database is a unique collection of recent information on CPPs, including quantitative cellular uptake values that have been experimentally obtained for each peptide.In addition to including all peptides in the CPPsite 2.0 database for which experimental quantitative cellular uptake data are available, POSEIDON has been highly enriched with up-to-date mining of the available literature.
A dataset of 2,371 entries was obtained through several steps of data acquisition and preprocessing, providing information about uptake evaluation methods, uptake conditions (such as temperature, cell line, and time of CPP incubation), uptake values, uptake units, cargoes, and peptide sequence.Both the CPPsite 2.0, and POSEI-DON databases share information on peptide sequences, characteristics, modifications, validation methods, and cargo types.However, POSEIDON stands out because it offers quantitative uptake values for CPPs, whereas CPPsite 2.0 provides qualitative data.
POSEIDON covers all types of CPPs, including L-amino acids, D-amino acids, L-and D-amino acids, and non-natural amino acids (Fig. 2A).The composition of CPPs revealed that certain types of residues, such as arginine, lysine, and leucine, were more prominent in CPPs than in methionine, aspartate, tyrosine, and asparagine residues, which were not enriched in CPPs (Fig. 2C).The positively charged residues like arginine and lysine in POSEIDON interact with negatively charged cell membrane components, increasing cellular uptake, as shown in Fig. 2B.The amphiphilic nature of CPPs, owing to their cationic and hydrophobic residues, enhances their interactions with the cell membrane and improves cell penetration [38] or cargo interaction [39].
This database provides peptide sequences that facilitate the retrieval of physicochemical properties that can be directly calculated from their primary sequences.Our dataset contained a significant number of peptides with lengths less than 10 amino acids (n = 821) and between 11 and 20 amino acids (n = 1,029), as shown in Fig. 3A.Most CPPs exhibit molecular weights ranging from 1 to 1.5 kDa.Both charge distribution and peptide length properties enable CPPs to interact with various cell-surface molecules, significantly influencing the selection of an entry pathway [40].Among several influencing factors, such as the physicochemical properties of the peptide and its cargo, the internalization routes of CPPs are primarily directed towards two major pathways: endocytosis (an active or energy-dependent process) and membrane translocation (a direct or passive energy-independent process) [41].Therefore, we analyzed the distribution of The diversity of cell lines ensures that CPP/cell line combinations can be analyzed using this database.
Scientific studies have shown that there are various roles associated with CPPs, ranging from fluorophores to nucleic acids.Thus, cargoes associated with each peptide are available in POSEIDON.As expected, our dataset demonstrated that fluorescein isothiocyanate (FITC), fluorescein, and carboxyfluorescein were the cargoes most strongly associated with CPPs (Fig. 4A).As shown in Fig. 4B, most CPPs in the dataset were associated with fluorophores (n = 4,368), followed by small ligands (n = 795), nanoparticles (n = 633), proteins (n = 600), and nucleic acids (n = 471).
Flow cytometry was the most commonly employed method for uptake evaluation in this dataset, accounting for 1,349 entries, whereas fluorescence microscopy, fluorescence spectroscopy, and Fluorescence-Activated Cell Sorting (FACS) were employed for 289, 247, and 155 entries, respectively (Fig. 4C).However, as shown in Fig. 4D, there was a high degree of variability in the uptake units, and several studies used slight differences in identical uptake unit designations.
After standardizing identical units to a unique designation, the mean fluorescence intensity was the most frequently employed unit in this dataset, with 481 entries.The different units presented in Fig. 4C highlight the lack of standardization in CPP uptake evaluations conducted in previous studies, which hinders the comparison and analysis of the CPP uptake data.Although there are currently no standardized methods for CPP uptake evaluation, flow cytometry has been employed significantly more frequently than the other methods.This suggests that it is possible to establish a general method using specific easily attainable controls, allowing a large amount of quantitative data to be acquired and compared more adequately and easily.This database also provides information on the temperature and time of CPP incubation.Due to the nature of CPPs and their internalization mechanisms, changes in certain conditions, such as temperature, can significantly impact the uptake of CPPs by cells, often due to alterations in the underlying mechanism [42][43][44].Thus, these data are highly valuable for the development of new approaches.

Processed database description
The POSEIDON database uptake-prediction methods developed in this study rely exclusively on fluorescence measurements.This approach was selected because other methods can produce inconsistent results, leading to discrepancies in the derived uptake units.Therefore, to establish a reliable benchmark dataset, we selected CPPs that were evaluated using fluorescence methods, resulting in a dataset of 1274 entries.After removing outliers, the final dataset contained 1263 entries.
As shown in red in the figures, most amino acids are L-amino acids (Fig. 2A) and were essentially hydrophobic and polar charged (Fig. 2B).Similar to the raw dataset, arginine, lysine, and leucine were present in large numbers in the CPP sequences, in contrast to methionine, aspartate, asparagine, and tyrosine residues, which were not prominent in CPPs (Fig. 2C).
The benchmark dataset included CPP sequences of various sizes, with sequences consisting of 11-20 residues being the most common (n = 619), followed by sequences with fewer than 10 residues (n = 316), and sequences consisting of 21-30 residues (n = 265) (Fig. 3A, red).In terms of cell lines, HeLa cells were the most frequently used, as in the raw dataset.However, the benchmark dataset showed the emergence of HepG2, Jurkat, and bEnd.3 cell lines as among the most frequently used cell lines for CPPs.Regarding cargo, the benchmark dataset showed a slightly different trend than the raw dataset, with Dil, rhodamine (Rho), small interfering RNA (siRNA), and TAM being highly associated with CPPs.Fluorophores were the most common cargo (n = 1,249), followed by nanoparticles (n = 198), small ligands (n = 165), nucleic acids (n = 110), and proteins (n = 56) (Fig. 4B, red).
Additional interesting information emerges when conducting a correlation analysis between the features and the processed target variable.Among the 30 features that exhibited the highest correlation with the target variable (Additional file 1: Table S4), 50% with the highest Pearson correlation were position-encoding features.One-third of the most correlated features are genomic features.Only two features from the entire sequence were present in the top 30, whereas cargo had 3.Although experimental features such as concentration and temperature were not included in the top 30, it is apparent that they are among the top 100 on the additional figures on the website.

Performance of the different predictors
After implementing the hyperparameter optimization pipeline (Table 3), the best-performing models were XGB and DNN, as indicated by their evaluation metrics on the independent test set that did not participate in either training or hyper-parameter optimization (Table 4).Specifically, both models achieved high r 2 scores, exceeding 0.76, whereas the other methods barely surpassed the 0.70 threshold.Furthermore, they exhibited high correlation metrics, with Pearson correlations above 0.87 and Spearman correlations above 0.88.Consequently, the final prediction pipeline of POSEIDON displays predictions generated by both DNN and XGB models.

Discussion
CPPs have great potential in therapy and diagnosis; however, identifying new and efficient CPPs can be costly and time-consuming.Consequently, computational biological studies have become increasingly important in this field, although they have mainly focused on the qualitative features of CPPs.POSEIDON addresses this gap by offering a novel up-to-date database that includes quantitative experimental uptake efficiency data and serves as a benchmark for the field.The POSEIDON database and prediction pipeline have provided several important insights into the rapidly evolving field of CPP research.First, it is evident that effective CPPs are characterized by an abundance of positively charged amino acids, which is biochemically logical because it allows peptides to leverage the electrostatic differences inside and outside the cell, thereby augmenting cellular internalization.Indeed, the internalization mechanism of CPPs remains a subject of ongoing debate, with CPP concentration, charge, and amphipathicity emerging as crucial factors.The intricate processes governing CPP internalization involve a combination of endocytic and direct translocation mechanisms [41].The positive charge, particularly from arginine residues, significantly influenced CPP uptake, with arginine being more favorable for delivery and CPP activity than lysine.Amphipathicity peptides can directly penetrate the cell membrane at low concentrations, whereas nonamphipathic CPPs rely on endocytosis [6].Regarding CPP concentration, endocytosis is typically the predominant mechanism under physiological conditions and at low peptide concentrations.In contrast, at higher peptide concentrations, direct translocation across the plasma membrane becomes more prevalent [41].Further investigation of the specific mechanisms employed by CPPs with different physicochemical properties and concentrations will provide valuable insights into the complex dynamics governing cellular uptake.Second, fluorophores are significant molecular interventions for CPP activity, as their presence is methodologically required, and they are highly correlated with the uptake variable, implying that they may intervene in molecular interactions.Moreover, the presence of cargo can modify the CPP uptake pathway, as demonstrated by the observed impact of cargo size and binding methodology on the CPP translocation mechanism [41,45].
Third, genomics descriptors play a crucial role in this process, which was not adequately addressed before POSEIDON.Notably, mutation of the NRAS gene, which is linked to cell division in cancer, was found to be the variable most correlated with CPP uptake, followed closely by mutation of IDH1, which is associated with the expression of isocitrate dehydrogenase 1, a key player in the Krebs Cycle.Exploring the biological relationship between these genes (and several others high in ranking) and CPPs might be a worthy endeavor.
Fourth, CPP penetration into cells is influenced by the cell line owing to differences in membrane composition, receptor expression, and intracellular mechanisms.These factors affect the effectiveness and penetration mechanism of CPPs.Understanding CPP behavior in specific cell lines is crucial for accurate results, as the findings may not apply universally, as studies on various cell lines reveal cell-dependent preferences for specific CPPs [41], which also supports targeted CPP application in various biological and therapeutic contexts.
The POSEIDON database is not only the largest but also a comprehensive, curated database with CPP information.The inclusion of an extensive range of experimental characteristics in our dataset underscores the complexity inherent in CPP behavior.The prediction method employed by POSEIDON is unique in that it effectively considers CPP uptake activity as a continuous variable, unlike previous efforts that only featured categorical predictions.Our approach also includes multiple previously unused sources of information, which will allow users to test sequence anomalies, select tissuespecific cell lines, choose up to two cargoes per peptide, and adjust experimental conditions, such as temperature, concentration, and incubation time.We ensured that the algorithm incorporated all relevant parameters, thereby Assessing the POSEIDON ML approach in comparison with other prediction methods poses a distinct challenge mainly because of the limited availability of similar approaches.Nonetheless, Dowaidar et al. represented an exception, as they spearheaded the creation of Fragment Quantitative Structure-Activity Relationship (FQSAR) models [46].These models were specifically tailored to forecast the biological activity of CPPs in peptide-based transfection systems (PBTS), trained on only 11 data points, yet achieved r 2 values ranging from 0.906 to 0.961 across various models.Nevertheless, POSEIDON stands out with very high correlation metrics and low errors, fully demonstrating its ability to predict CPP uptake under different conditions with exceptional performance.

Conclusion
POSEIDON provides the first quantitative data on cellular uptake, methodology, units, and experimental conditions, making it an exceptional tool.The POSEIDON database, a recently launched, open-source, and comprehensive resource, focuses exclusively on curated CPPs with quantitative uptake values.Each CPP in the database is accompanied by physicochemical properties, cell line, cargo, sequence, uptake evaluation method, concentration, temperature, and incubation time.The POSEI-DON predictor is also groundbreaking, as it was the first tool to predict CPP uptake based on quantitative uptake and genomic data.With its dynamic, free, and easy-touse interface, users can easily submit a peptide sequence and obtain computational predictions of its uptake in various cell lines.Additionally, users can customize properties, such as peptide concentration, incubation time, temperature, and cell line type.The POSEIDON database is a unique resource for researchers to develop new methodologies and predictors for CPP sequence design, based on uptake values.

Fig. 2
Fig. 2 Representation of peptide composition in the POSEIDON database, raw data in blue, and benchmark data in red based on A chirality/modifications of CPP, B the type of amino acid, and C quantification of the amino acid composition of CPPs.The data pertain to peptides without non-natural amino acids

Fig. 3 Fig. 4
Fig. 3 CPP features in both datasets (raw data in blue and benchmark data in red).A Length of peptide sequences in the database.B The 10 most used cell lines according to the dataset

Table 1
POSEIDON features for the ML summary table

Table 2
work (fNN).While most of these models are standard imports from their respective packages, the fNN was designed for these purposes, comprising a neural network with different points of entry for each feature block type.All models were parameterized using the training set and an independent testing set.In this study, we evaluated the performance of our regression ML models using several metrics, including Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error (MAE), Pearson correlation, Spearman correlation, and coefficient of determination (r 2 )-

Table 4
Results for the best performance of each optimized ML model enabling it to capture intricate and nonlinear relationships among the variables.This approach enhances the predictive capacity of the model, making it adept at handling multifaceted experimental conditions encountered in various studies.